Take-home_Ex01

Author

Wan Xinyu

1. The task

The purpose of this take home exercise is to to reveal the demographic and financial characteristics of the city of Engagement, using appropriate static and interactive statistical graphics methods. This exercise requires a user-friendly and interactive solution that helps city managers and planners to explore the complex data in an engaging way and reveal hidden patterns.

2. Data preparation

2.1 Installing the data packages

pacman::p_load(ggplot2, ggiraph, plotly, 
               patchwork, DT, tidyverse,
               ggrepel, ggthemes, hrbrthemes,
               tidyverse,ggstatsplot,pals,readxl, 
               performance, parameters, see) 

ggplot2: A data visualization package for creating graphics in R. It provides a flexible and powerful framework for creating elegant and complex graphs. Documentation: https://ggplot2.tidyverse.org/

ggiraph: An extension to ggplot2 that allows you to add interactivity to your ggplot2 graphics. It provides a set of functions that enable you to create interactive tooltips and clickable regions on your plots. Documentation: https://davidgohel.github.io/ggiraph/index.html

plotly: A data visualization package that allows you to create interactive, web-based plots in R. It provides a wide range of chart types, including scatter plots, line charts, and heatmaps. Documentation: https://plotly.com/r/

patchwork: A package for combining multiple ggplots into a single plot. It provides a flexible and powerful framework for arranging, annotating, and styling your plots. Documentation: https://patchwork.data-imaginist.com/

DT: A package for creating interactive data tables in R. It provides a wide range of features, including sorting, filtering, and pagination. Documentation: https://rstudio.github.io/DT/

tidyverse: A collection of packages for data manipulation and visualization in R. It provides a consistent set of tools for working with data, including data cleaning, transformation, and visualization. Documentation: https://www.tidyverse.org/

ggrepel: A package for avoiding overplotting of text labels in ggplot2. It provides a set of functions for positioning labels in a way that minimizes overlap and maximizes readability. Documentation: https://ggrepel.slowkow.com/

ggthemes: A package for creating visually appealing themes for ggplot2 graphics. It provides a set of pre-designed themes that can be used to customize the appearance of your plots. Documentation: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/

hrbrthemes: A package for creating visually appealing themes for ggplot2 graphics. It provides a set of pre-designed themes that are designed to be aesthetically pleasing and easy to read. Documentation: https://hrbrmstr.github.io/hrbrthemes/

ggstatsplot: A package for creating publication-ready plots with statistical analysis built-in. It provides a set of functions for creating common statistical plots, including scatterplots, boxplots, and histograms, with statistical tests and summary information included. Documentation: https://indrajeetpatil.github.io/ggstatsplot/

pals: A package for creating color palettes for data visualization. It provides a set of functions for generating color palettes based on various themes and color schemes. Documentation: https://cran.r-project.org/web/packages/pals/vignettes/pals_examples.html

readxl: A package for reading Excel files in R. It provides a set of functions for importing data from Excel spreadsheets into R data frames. Documentation: https://readxl.tidyverse.org/

performance: A package for evaluating the performance of predictive models in R. It provides a set of functions for creating various types of performance metrics and visualizations. Documentation: https://easystats.github.io/performance/

parameters: A package for managing parameters and arguments in R. It provides a set of functions for defining, validating, and documenting parameters in R functions. Documentation: https://www.rdocumentation.org/packages/parameters/versions/0.21.0

see: A package for visualizing the output of R functions. It provides a set of functions for creating visual summaries of data frames, matrices, and other R objects. Documentation: https://cran.r-project.org/web/packages/see/index.html

2.2 Dataset introduction

The unzipped files have been saved into a new folder named data for better organization.

part_info <- read_csv("data/Participants.csv")

Participants.csv

Contains information about the residents of City of Engagement that have agreed to participate in this study.

  • participantId (integer): unique ID assigned to each participant.
  • householdSize (integer): the number of people in the participant’s household
  • haveKids (boolean): whether there are children living in the participant’s household.
  • age (integer): participant’s age in years at the start of the study.
  • educationLevel (string factor): the participant’s education level, one of: {“Low”, “HighSchoolOrCollege”, “Bachelors”, “Graduate”}
  • interestGroup (char): a char representing the participant’s stated primary interest group, one of {“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”}. Note: specific topics of interest have been redacted to avoid bias.
  • joviality (float): a value ranging from [0,1] indicating the participant’s overall happiness level at the start of the study.
finance <- read_csv("data/FinancialJournal.csv")

FinancialJournal.csv

Contains information about financial transactions.

  • participantId (integer): unique ID corresponding to the participant affected
  • timestamp (datetime): the time when the check-in was logged
  • category (string factor): a string describing the expense category, one of {“Education”, “Food”, “Recreation”, “RentAdjustment”, “Shelter”, “Wage”}
  • amount (double): the amount of the transaction For explanation of Rent Adjustment, please refer to this link

2.3 Examining the datasets

Participant csv

Lets first examine the properties of the participants csv file.

datatable(part_info)
str(part_info)
spc_tbl_ [1,011 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ participantId : num [1:1011] 0 1 2 3 4 5 6 7 8 9 ...
 $ householdSize : num [1:1011] 3 3 3 3 3 3 3 3 3 3 ...
 $ haveKids      : logi [1:1011] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ age           : num [1:1011] 36 25 35 21 43 32 26 27 20 35 ...
 $ educationLevel: chr [1:1011] "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" ...
 $ interestGroup : chr [1:1011] "H" "B" "A" "I" ...
 $ joviality     : num [1:1011] 0.00163 0.32809 0.39347 0.13806 0.8574 ...
 - attr(*, "spec")=
  .. cols(
  ..   participantId = col_double(),
  ..   householdSize = col_double(),
  ..   haveKids = col_logical(),
  ..   age = col_double(),
  ..   educationLevel = col_character(),
  ..   interestGroup = col_character(),
  ..   joviality = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Removed participant Id, education level and interestgroup from the following since summary are not meaningful

summary(part_info[, c("householdSize", "haveKids", "age","joviality")])
 householdSize    haveKids            age          joviality       
 Min.   :1.000   Mode :logical   Min.   :18.00   Min.   :0.000204  
 1st Qu.:1.000   FALSE:710       1st Qu.:29.00   1st Qu.:0.240074  
 Median :2.000   TRUE :301       Median :39.00   Median :0.477539  
 Mean   :1.964                   Mean   :39.07   Mean   :0.493794  
 3rd Qu.:3.000                   3rd Qu.:50.00   3rd Qu.:0.746819  
 Max.   :3.000                   Max.   :60.00   Max.   :0.999234  

Count of resident for each interest group

table(part_info$interestGroup)

  A   B   C   D   E   F   G   H   I   J 
102  91 102  96  83 106 108 111  96 116 

Count of resident for each education level

table(part_info$educationLevel)

          Bachelors            Graduate HighSchoolOrCollege                 Low 
                232                 170                 525                  84 
sum(is.na(part_info))
[1] 0
duplicates1 <- duplicated(part_info)
part_info[duplicates1, ]
# A tibble: 0 × 7
# ℹ 7 variables: participantId <dbl>, householdSize <dbl>, haveKids <lgl>,
#   age <dbl>, educationLevel <chr>, interestGroup <chr>, joviality <dbl>

Since 0 rows are displayed. We can confirm that duplicates do not exist in this csv file.

Finance csv

Lets move on to examine the properties of the finance csv file.

Sneak peak of the first few entries in the dataset

head(finance)
# A tibble: 6 × 4
  participantId timestamp           category  amount
          <dbl> <dttm>              <chr>      <dbl>
1             0 2022-03-01 00:00:00 Wage      2473. 
2             0 2022-03-01 00:00:00 Shelter   -555. 
3             0 2022-03-01 00:00:00 Education  -38.0
4             1 2022-03-01 00:00:00 Wage      2047. 
5             1 2022-03-01 00:00:00 Shelter   -555. 
6             1 2022-03-01 00:00:00 Education  -38.0
Warning

Please do not use datatable here or you will face a error of ‘Fatal javascript OOM in reached Heap Limit’ which indicate that R studio session has run out of memory while attempting to execute JavaScript code.

str(finance)
spc_tbl_ [1,513,636 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ participantId: num [1:1513636] 0 0 0 1 1 1 2 2 2 3 ...
 $ timestamp    : POSIXct[1:1513636], format: "2022-03-01" "2022-03-01" ...
 $ category     : chr [1:1513636] "Wage" "Shelter" "Education" "Wage" ...
 $ amount       : num [1:1513636] 2473 -555 -38 2047 -555 ...
 - attr(*, "spec")=
  .. cols(
  ..   participantId = col_double(),
  ..   timestamp = col_datetime(format = ""),
  ..   category = col_character(),
  ..   amount = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Only ‘amount’ variable summary is displayed since summary of other variables are not useful

summary(finance[c("amount")])
     amount         
 Min.   :-1562.726  
 1st Qu.:   -5.594  
 Median :   -4.000  
 Mean   :   20.047  
 3rd Qu.:   21.598  
 Max.   : 4096.526  
table(finance$category)

     Education           Food     Recreation RentAdjustment        Shelter 
          3319         790051         296013            131          11463 
          Wage 
        412659 
sum(is.na(finance))
[1] 0

There is ni missing values

Check for outlier in the amount variable. We first group the amount variables by the category. Then we do a box plot. From the chart we can observe that shelter has some abnormally small values to the negative end and wages has some exceptionally large values on the positive end. We may wish to take note of these in our analysis.

Show the code
# Create a box plot of amount by category
ggplotly(ggplot(finance, aes(x = category, y = amount, fill = category)) +
  geom_boxplot() +
  xlab("Expense Category") +
  ylab("Amount") +
  ggtitle("Amount Spent by Expense Category"))

Now lets move to check for duplicates. The code below will display the duplicates in the financial_journal.csv

duplicates <- duplicated(finance)
finance[duplicates, ]
# A tibble: 1,113 × 4
   participantId timestamp           category   amount
           <dbl> <dttm>              <chr>       <dbl>
 1             0 2022-03-01 00:00:00 Shelter    -555. 
 2             0 2022-03-01 00:00:00 Education   -38.0
 3             1 2022-03-01 00:00:00 Shelter    -555. 
 4             1 2022-03-01 00:00:00 Education   -38.0
 5             2 2022-03-01 00:00:00 Shelter    -557. 
 6             2 2022-03-01 00:00:00 Education   -12.8
 7             3 2022-03-01 00:00:00 Shelter    -555. 
 8             3 2022-03-01 00:00:00 Education   -38.0
 9             4 2022-03-01 00:00:00 Shelter   -1556. 
10             4 2022-03-01 00:00:00 Education   -12.8
# ℹ 1,103 more rows

We will remove these duplicates and resave the file as finance using the following code

finance <- distinct(finance)

Lets just confirm that the duplicates are removed using the following code. The output of the following should be the same.

nrow(finance)
[1] 1512523
nrow(distinct(finance))
[1] 1512523
Warning

For better data presentation and consistency, we may encode all expenses into positive instead of negative using the code below. However, we will not do this as we will will analysis rent adjustments which has positive and negative and may cause confusion

#finance$amount <- abs(finance$amount)

2.4 Checking for anomalies in datasets

Show the code
# create a datatable of participant ID and number of months they made transaction in
finance %>%
  mutate(date = as.Date(timestamp)) %>%
  group_by(participantId) %>%
  summarise(num_months = n_distinct(format(date, "%Y-%m"))) %>%
  datatable()
Note

Clicking from the above table, we do realize that most participant id has transaction in all 12 months but some only had transactions in one month. We should remove them as it is likely they are not residents

Using the code below, we built a table and identify residents with irregular transactions

Show the code
finance %>%
  mutate(date = as.Date(timestamp)) %>%
  group_by(participantId) %>%
  summarise(num_months = n_distinct(format(date, "%Y-%m"))) %>%
  filter(num_months != max(num_months)) %>%
  datatable()

We will move on to classify the above identified participants as non residents and remove them from the data sets

Removing non-residents from finance dataset

# calculate num_months for each participant
monthly_counts <- finance %>%
  mutate(date = as.Date(timestamp)) %>%
  group_by(participantId) %>%
  summarize(num_months = n_distinct(format(date, "%Y-%m")))

# find participants with num_months different from the maximum num_months
non_residents <- monthly_counts %>%
  filter(num_months != max(num_months))

# remove non-residents from the finance data frame
finance <- finance %>%
  filter(!participantId %in% non_residents$participantId)

A reduction in rows is observed below indicating removal of 131 residents.

# Get number of rows for finance data
nrow(distinct(finance))
[1] 1509897

Removing non-residents from participant dataset

We will also update out participant csv to remove the non residents

# remove non-residents from the finance data frame
part_info <- part_info %>%
  filter(!participantId %in% non_residents$participantId)

We can see a reduction in 131 rows as well

# Get number of rows for participant data
nrow(distinct(part_info))
[1] 880

3. Data visualization

Lets now visualize the data provided in the data set participants.csv

Show the code
# Histogram of age
v1<- ggplot(part_info, aes(x = age)) +
  geom_bar(binwidth = 5) +
  labs(title = "Distribution of Participant Age",
       x = "Age (years)",
       y = "Count")
ggplotly(v1)
Show the code
# Bar chart of education level
v2<- ggplot(part_info, aes(x = educationLevel)) +
  geom_bar() +
  labs(title = "Education Level of Participants",
       x = "Education Level",
       y = "Count")
ggplotly(v2)
Show the code
# Bar chart of household size
v3<- ggplot(part_info, aes(x = householdSize)) +
  geom_bar() +
  labs(title = "Household Size of Participants",
       x = "Household size",
       y = "Count")
ggplotly(v3)
Show the code
# Bar chart of whether participants have children
v4<- ggplot(part_info, aes(x = factor(haveKids))) +
  geom_bar() +
  labs(title = "Proportion of Participants with Children",
       x = "Have Children",
       y = "Count")
ggplotly(v4)
Show the code
# Bar chart of interest group distribution
v5 <- ggplot(data = part_info, aes(x = interestGroup)) +
      geom_bar(aes(text = paste("\n","Count: ", ..count.., "\n"," Percentage: ", scales::percent(..count../sum(..count..))))) +
      labs(title = "Interest Group Distribution", x = "Interest Group", y = "Count")

ggplotly(v5,tooltip = "text")
Show the code
# Histogram of joviality distribution
v6<- ggplot(part_info, aes(x = joviality)) +
  geom_histogram(binwidth = 0.05, fill = "grey", color = "white") +
  labs(title = "Joviality Distribution", x = "Joviality", y = "Count")
ggplotly(v6)

3.1 Exploring participants data

3.1.1 Education vs age

Show the code
v7<- ggplot(part_info, aes(x = age, fill = educationLevel)) +
  geom_bar() +
  labs(title = "Distribution of Education Level by age",
       x = "Age (years)",
       y = "Education Level")
v8<- ggplot(part_info, aes(x = age)) +
  geom_bar() +
  labs(title = "Age Distribution by Education Level",
       x = "Age (years)",
       y = "Count") +
  facet_wrap(~ educationLevel, ncol = 2) #wrapping the 4 charts together into 1

# Combining the 2 charts together for display
v7 + v8

From the charts, we can observe that across different age, high school educated residents are of the majority while low educated residents are of the minority.

Note

Added fig.height to make sure that the charts are not overly compressed

3.1.2 Household size vs have kids

Show the code
# Creating a jitter plot for household size and whether residents have children
ggplot(part_info, aes(x = haveKids, y = householdSize)) +
  geom_jitter() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Relationship between Household Size and Having Children",
       x = "Have Children",
       y = "Household Size")

From the chart, we observe a pattern that only household with 3 person have kids while those with 2 do not. This is useful because for the authorities, provison of subsidies such as milk and diaper vouchers can be directed specifically to family with more than 3 household members

3.1.3 Have kids vs education

Show the code
# We use this line to relabel the education level according to acceding order
edu_levels <- c("Low", "HighSchoolOrCollege", "Bachelors", "Graduate")

# convert education level column to our above specified levels
part_info$educationLevel <- factor(part_info$educationLevel, levels = edu_levels)

# Calculate percentage of each education level group with children
edu_kids <- part_info %>%
  group_by(educationLevel, haveKids) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count))

# Plot the bar chart
ggplotly(ggplot(edu_kids, aes(x = educationLevel, y = percentage, fill = haveKids)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Percentage of Participants with Children by Education Level",
       x = "Education Level",
       y = "Percentage with Children") +
  scale_fill_manual(values = c("#E69F00", "#56B4E9"), labels = c("No", "Yes")))

From the above we can observe that across all education levels, percentage of residents who have kids are generally lower that than percentage of residents who do not have kids. There seems to be an inverse relationship between education level and if a resident has kids (with exception to high school where it is higher than low education residents).

Note

We specify the desired order of the education levels using the edu_levels vector. Then, we convert the education level column to a factor with the desired levels using the factor() function and the levels

3.1.4 Mean age vs having kids

Show the code
# Define a function to create a tooltip with information about the mean and standard error of a y-value
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)# Calculate the mean and standard error of the y-value and format them as strings
  sem <- scales::number(ymax - y, accuracy = accuracy)
  paste("Mean age:", mean, "+/-", sem)# Create a string that combines the mean and standard error
}
# Create a ggplot object with part_info as the data and haveKids as the x-variable
gg_point <- ggplot(data=part_info, 
                   aes(x = haveKids),
) +
  stat_summary(aes(y = age, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),# Add a summary statistic of the mean and standard error of age for each value of haveKids  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = age),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  ) # Add error bars showing the standard error of the mean for each value of haveKids

# Create an interactive plot using girafe with gg_point as the input and a specified width and height
girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

From the chart above, we do not observe any significant difference.However we do note that mean age of residents who do not have kids are sightly lower than those who do have kids.

Let’s perform a statistical test to determine if there is a difference in the mean

Show the code
# Create a plot with ggbetweenstats that displays the distribution of age for each value of haveKids
ggbetweenstats(
  data = part_info,
  x = haveKids,
  y = age,
  plot.type = "box",
  pairwise.comparisons = TRUE,
  mean.plotting = TRUE,
   message = TRUE ,
  xlab = "Having Kids",
  ylab = "Age"
)

At a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean age of resident who have and do not have kids are the same

3.1.5 Percentage of Participants in Each Interest Group by Education Level

Show the code
# Define a character vector of education levels
edu_levels <- c("Low", "HighSchoolOrCollege", "Bachelors", "Graduate")

# Convert the "educationLevel" variable in the "part_info" data frame to a factor variable and specify the order of the levels
part_info$educationLevel <- factor(part_info$educationLevel, levels = edu_levels)

# Group the "part_info" data frame by education level and interest group, count the number of participants in each group, and calculate the percentage of participants in each group
grouped_data <- part_info %>%
  group_by(educationLevel, interestGroup) %>%
  summarise(count = n()) %>%
  mutate(percentage = prop.table(count) * 100)

# Create a bar plot with ggplot that shows the percentage of participants in each interest group by education level
plot <- ggplot(grouped_data, aes(x = educationLevel, y = percentage, fill = interestGroup)) +
  geom_col(position = "dodge") +
  labs(title = "Percentage of Participants in Each Interest Group by Education Level",
       x = "Education Level",
       y = "Percentage") +
  scale_fill_brewer(palette = "Set3") +
  theme()

# Making the plot interactive
ggplotly(plot)

From the above chart, we observe that interest group I has exceptionally high percentage of participants from low education level while interest group E has the exceptionally low percentage of participant from low education level residents. For other education levels, interest group participation, percentage of participation across interest groups varies shows less of huge difference. More information may be required to determine the reasons behind.

3.1.6 Mean joviality vs education

Show the code
# creates a string that displays the mean joviality score along with the standard error.
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  paste("Mean jovility:", mean, "+/-", sem)
}
# Create a ggplot, where the x-axis represents education level and the y-axis represents joviality score.
gg_point <- ggplot(data=part_info, 
                   aes(x = educationLevel),
) +
  stat_summary(aes(y = joviality, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = joviality),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  )# Add the mean and standard error to the plot
# Create a interactive plot
girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

We can observe from the above chart that mean joviality increases with increasing education level.

Let’s perform a statistical test to determine if there is a difference in the mean joviality

Show the code
# Create the plot of joviality against education level 
ggbetweenstats(
  data = part_info,
  x = educationLevel,
  y = joviality,
  plot.type = "box",
  mean.plotting = TRUE,
  xlab = "Education Level",
  ylab = "Joviality"
)

At a significance level of 5%, since the p value is smaller than 0.05, we reject the null hypothesis that the mean joviality of resident with different education are the same. Therefore, we confirm that mean joviality is different across education levels.

3.1.7 Mean joviality vs household size

Show the code
# create a tooltip that shows the mean joviality value and standard error of the mean for each point on the plot
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  paste("Mean jovility:", mean, "+/-", sem)
}

gg_point <- ggplot(data=part_info, 
                   aes(x = householdSize),
) +
  stat_summary(aes(y = joviality, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = joviality),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  )#Add both points and error bars to the plot 

# Create interactive graph
girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

We observe that mean joviality for single person household is highest display an decreasing trend with increasing number of members of the household. Household size of 3 seems to have the least joviality. A possible reason may be that stress from raising the children may have made them less jovial, this may be a potential area that authorities can examine (with more information) if they hope to raise the mean joviality of the group.

Let’s perform a statistical test to determine if there is a difference in the mean joviality

Show the code
# Create the plot of joviality against household size
ggbetweenstats(
  data = part_info,
  x = householdSize,
  y = joviality,
  plot.type = "box",
  mean.plotting = TRUE,
  xlab = "Household Size",
  ylab = "Joviality"
)

However, at a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of residents of different household size are the same.

3.1.8 Mean joviality vs have kids

Show the code
# calculate the mean and standard error of the mean.Gets the unique values of the haveKids column and  include in the tooltip.
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  havekids <- unique(part_info$haveKids)
  paste("Have Kids:", havekids, "<br>",
        "Mean Joviality:", mean, "+/-", sem)
}

gg_point <- ggplot(data=part_info, 
                   aes(x = haveKids),
) +
#Calculate the mean and standard error of the mean for each group of participants who do or do not have kids. 
  stat_summary(aes(y = joviality, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = joviality),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2 
  ) +
  labs(
       x = "Have Kids",
       y = "Joviality")

# Creates the interative plot
girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

From the chart above, we observe that mean joviality is lower for residents who have kids, this correspond to our previous analysis of mean joviality vs household size where household with 3 residents has the lowest mean joviality which may be caused by additional stress of taking care of kids

Let’s perform a statistical test to determine if there is a difference in the mean joviality

Show the code
# Creates a plot of joviality against if resident have kids
ggbetweenstats(
  data = part_info,
  x = haveKids,
  y = joviality,
  plot.type = "box",
  pairwise.comparisons = TRUE,
  mean.plotting = TRUE,
  xlab = "Having Kids",
  ylab = "Joviality"
)

At a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of resident who have and do not have kids are the same. There is insufficient evidence to suggest that there is a difference in mean joviality.

3.1.9 Mean joviality vs age

Show the code
# calculates the mean joviality for each age group , creates a new data frame called age_joviality 
age_joviality <- part_info %>% 
  group_by(age) %>% 
  summarize(mean_joviality = mean(joviality))
# Creates a interactive plot
ggplotly(ggplot(age_joviality, aes(x = age, y = mean_joviality)) +
  geom_smooth(method = "lm", se = FALSE) # Adds linear regression line
  + geom_point() +
  labs(x = "Age", y = "Mean Joviality") +
  ggtitle("Mean Joviality against Age"))

From the point plot alone, we do not see a distinct trend with age and mean joviality except that resident of age 53 has the lowest mean joviality and resident of age 59 has the highest mean joviality. When we overlay a smoothed linear plot, we can see that mean joviality decreases with increasing age.

Let’s perform a statistical test to determine if there is a difference in the mean joviality

Show the code
# Creates a plot of mean joviality against age using the age_joviality dataframe above
ggscatterstats(
  data = age_joviality,
  x = age,
  y = mean_joviality,
  marginal = FALSE,
  )+labs(x = "Age", y = "Mean Joviality")

At a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of resident is different across age group

Further examination
Show the code
# Add a column to indicate outliers using IQR method
is_outlier <- function(x) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = TRUE)
  H <- 1.5 * IQR(x, na.rm = TRUE)
  x < (qnt[1] - H) | x > (qnt[2] + H)
}

# Create data subset for age 53
age53_joviality <- subset(part_info, age == 53)

# Create outlier column using is_outlier function
age53_joviality$outlier <- is_outlier(age53_joviality$joviality)

# Create ggplot box plot with outlier points colored red
plot53 <- ggplot(data = age53_joviality, aes(x = 1, y = joviality)) +
  geom_boxplot() +
  geom_point(aes(x = jitter(1, factor = 0.3), y = joviality, color = outlier)) +
  scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
  theme_bw() +
  ggtitle("Joviality at Age 53") +
  xlab("") +
  ylab("Joviality")

# Convert ggplot object to plotly
plotly_object <- ggplotly(plot53)


# Create data subset for age 59
age59_joviality <- subset(part_info, age == 59)

# Create outlier column using is_outlier function
age59_joviality$outlier <- is_outlier(age59_joviality$joviality)

# Create ggplot box plot with outlier points colored red
plot59 <- ggplot(data = age59_joviality, aes(x = 1, y = joviality)) +
  geom_boxplot() +
  geom_point(aes(x = jitter(1, factor = 0.3), y = joviality, color = outlier)) +
  scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
  theme_bw() +
  ggtitle("Joviality at Age 59") +
  xlab("") +
  ylab("Joviality")
# subplot for both plots
subplot(plot53, plot59, nrows = 1, titleY = TRUE, titleX = TRUE, margin = 0.1 ) %>%
  layout(title = 'Further checking',
         plot_bgcolor='#e5ecf6', 
         xaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff'), 
         yaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff')) %>%
  layout(annotations = list(
    list(
      x = 0.25,  
      y = 1.0,  
      text = "Distribution of Age 53 Participants' Joviality",  
      xref = "paper",  
      yref = "paper",  
      xanchor = "center",  
      yanchor = "bottom",  
      showarrow = FALSE 
    ),
    list(
      x = 0.75,  
      y = 1.0,  
      text = "Distribution of Age 59 Participants' Joviality",  
      xref = "paper",  
      yref = "paper",  
      xanchor = "center",  
      yanchor = "bottom",  
      showarrow = FALSE 
    )
  ))

We can observe that joviality for age 53 residents are more concentrated to below 0.5 while joviality for age 59 residents are more evenly distributed across the axis. This confirms our expectation that age 53 residents may indeed be generally not jovial while for age 59 residents, there is no such general concentration of joviality observed.

3.2.0 Mean Joviality vs Interest group

Show the code
# Aggregate joviality by interest group
joviality_interest <- aggregate(joviality ~ interestGroup, data = part_info, mean)

# Plot mean joviality by interest group
ggplot(data = joviality_interest, aes(x = interestGroup, y = joviality)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = round(joviality, 2)), vjust = -0.5) +
  labs(x = "Interest Group", y = "Mean Joviality", 
       title = "Mean Joviality by Interest Group")

Interest group E provides the highest mean joviality. If officials are looking to improve joviality among residents, they may consider encouraging participation in interest group E through ways such as subsidizing group E membership, etc

Let’s perform a statistical test to determine if there is a difference in the mean joviality

Show the code
# Creates a plot of joviality against interest group
ggbetweenstats(
  data = part_info,
  x = interestGroup,
  y = joviality,
  plot.type = "box",
  mean.plotting = TRUE,
  xlab = "Interest Group",
  ylab = "Joviality")+
  scale_color_brewer(palette = "Set3")

At a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of resident who participate in different interest groups are the same.

Note

We use scale_color_brewer() as default palette only has 8 colors. We can define out own colors as well to be used in palette

3.2 Exploring financial data

Next we move on to explore variables in the financial data. Since every participant can have multiple entries. We will explore the data by grouping the entries according the participant’s Id and the category

3.21 Sum of residents expenditure by category

Show the code
# Aggregate financial data by participant
financial_data_agg <- finance %>%
  group_by(participantId,category) %>%
  summarize(total = sum(amount), .groups = "drop")

# Financial summary
expenses_summary <- financial_data_agg %>%
  group_by(category) %>%
  summarize(total = sum(total))

# Bar chart of expenses by category
expenses_plot <- ggplot(expenses_summary, aes(x = category, y = total)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Expense Category", y = "Total Amount Spent", title = "Expenses by Category") +
  theme_minimal()
# Creates the interactivity 
ggplotly(expenses_plot)

In the chart above we realize wage is counted as part of expenditure We will remove wage since it is not exactly an expense that we will be looking at.

Expenses by category (removing wage)

Show the code
# Aggregate financial data by participant
financial_data_agg <- finance %>%
   filter(category != "Wage") %>%
  group_by(participantId, category) %>%
  summarize(total = sum(amount), .groups = "drop")

# Financial summary
expenses_summary <- financial_data_agg %>%
  group_by(category) %>%
  summarize(total = sum(total))

# Bar chart of expenses by category
expenses_plot1 <- ggplot(expenses_summary, aes(x = category, y = total)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Expense Category", y = "Total Amount Spent", title = "Expenses by Category") +
  theme_minimal()
# Creates the interactivity
ggplotly(expenses_plot1)

From the chart above we realize that shelter accounts for the main category for expenditure. This is followed by recreation, food and education. Rent adjustment is on the negative end which may be a indication that the landlord in the city has overall lowered their rents.

3.2.2 Total wage against month

Next we will examine the wages over the months across the time stamp period.

Show the code
# Convert the timestamp column to a datetime object
finance$timestamp <- as.POSIXct(finance$timestamp)

# Create a line chart of wages per month
wages_per_month <- finance %>%
  filter(category == "Wage") %>%
  group_by(month = floor_date(timestamp, "month")) %>%
  summarize(total_wages = sum(amount)) %>%
  ggplot(aes(x = month, y = total_wages)) +
  geom_line() +
  labs(x = "Month", y = "Total Wages", title = "Total Wages against Month") +
  scale_x_datetime(date_labels = "%b %Y")

# Display the plot
print(wages_per_month)

From the above chart, we realize that in March, total wage is abnormally high. This may be an anomaly that that may require further information to examine. For now, we will retain it as it is.

3.2.3 Average Amount by Category over Time

Show the code
# Aggregate financial data by category and timestamp
financial_data_agg <- finance %>%
  filter(category != "Wage") %>%
  group_by(category, timestamp) %>%
  summarize(avg_amount = mean(amount))

# Line chart of average amount by timestamp, colored by category
line_chart <- ggplot(financial_data_agg, aes(x = timestamp, y = avg_amount, color = category)) +
  geom_line() +
  labs(x = "Timestamp", y = "Average Amount", title = "Average Amount by Category over Time") +
  theme_minimal()

# Display the chart
ggplotly(line_chart)

From the above chart, we observe that mapping average amount for category by days in the time stamp is not visually pleasing to see any patterns. But still if we zoom in, we can get some information from it.We can see that there are some major fluctuation in shelter amount and rent adjustment in march and April. Education amount is incurred on first day of the month and that if we zoom in. We can see that recreation and food demonstrates a regular pattern as shown in the pic below

Lets try to map above information in terms of months

3.2.4 Average Amount Spent by Category per Month

Show the code
# Extract month from timestamp
financial_data_agg <- finance %>%
  filter(category != "Wage") %>%
  mutate(month = format(timestamp, "%Y-%m")) %>%
  group_by(participantId, category, month) %>%
  summarize(total = sum(amount), .groups = "drop")

# Calculate average amount spent by category per month
category_month_avg <- financial_data_agg %>%
  group_by(category, month) %>%
  summarize(avg_amount = mean(total))

# Create bar chart
category_month_plot <- ggplot(category_month_avg, aes(x = month, y = avg_amount, fill = category)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Month", y = "Average Amount Spent", title = "Average Amount Spent by Category per Month") +
  theme(axis.text.x = element_text(angle = 45,
                                   vjust = 0.5,
                                   hjust = 0.5))
# Plot interactive chart
ggplotly(category_month_plot)

From the chart above we observe that the amount spent on average per category is indeed in the following order with education < food < recreation < shelter. We can also observe some anomalies in March and April. There is an exceeding large expenditure on shelter and recreation in march and in April the rent adjustment increased exceptionally as well. The expenditure in shelter may have being transferred to increase rent adjustment.

Note

Using theme minimal will result in overlapping of x-axis. I use theme here so that angle of x-axis label can be adjusted 45 degrees

Note

Since not everyone in the city is a landlord / tenant. This chart only serve as a benchmark of average amount city resident spent.

3.3 Merged visualization

Now lets see some visualizations after we merge these 2 files together using the code below

Show the code
financial_journal <- finance %>%
  mutate(date = as.Date(timestamp))

# Aggregate financial data by participantId and category
agg_financial <- financial_journal %>%
  group_by(participantId, category) %>%
  summarize(total_spent = sum(amount), .groups = "drop")

# Merge demographic data with aggregated financial data
merged_data <- part_info %>%
  left_join(agg_financial, by = "participantId")

We can see the residents’ total expenditures in each category and their respective wage using the below code.

Show the code
DT::datatable(merged_data[, c("participantId", "category", "total_spent")], class = 'compact')

Lets move on to see some EDAs

3.3.1 Wage agaist joviality

Show the code
# Subset merged_data to only include rows where category is "wage"
wage_data <- merged_data %>% filter(category == "Wage")
# Creates the interactive plot
ggplotly(ggplot(wage_data, aes(x = total_spent, y = joviality)) +
  geom_smooth() +
  labs(title = "Wage vs. Joviality", x = "Wage", y = "Joviality")+ geom_point())

Using the smooth trend line, we can observe that joviality generally decrease with increasing wage. This may be because a higher wage may mean more responsibility at work and hence greater amount of stress which lead to lower joviality

Let’s perform a statistical test to determine if there is a correlation between joviality and wage

Show the code
# Creates a plot of joviality against wage
ggscatterstats(
  data = wage_data,
  x = total_spent,
  y = joviality,
  marginal = FALSE,
  ) + labs(x = "Wage", y = "Joviality")

At 5% significance level, since the p value is smaller than 0.05, we reject the null hypothesis that there is no correlation between joviality and wage

3.3.2 Wage against education level

Show the code
# Creates a plot wage against education level using "wage_data" defined above
ggplot(wage_data, aes(x = educationLevel, y = total_spent)) +
  geom_point() +
  geom_boxplot(alpha = 0.5, width = 0.2, outlier.size = 1) +
  labs(x = "Education Level", y = "Wage")

From the chart above we can observe increasing education level of residents has increasing level of wage

3.3.3 Wage against interest group

Show the code
# Creates a plot of wage across interest group using the "wage_data" defined above
ggplot(wage_data, aes(x = interestGroup, y = total_spent)) +
  geom_point(alpha = 0.5) +
  geom_boxplot(alpha = 0.5, width = 0.2, outlier.size = 1) +
  labs(x = "Interest Group", y = "Wage")

From the above chart, wage distribution across interest group seems to be well spread. But Interest group I seems to have attracted on one of the largest wage resident in the city.

Let’s perform a statistical test to determine if there is a correlation between joviality and wage

# Construct a linear regression model to the wage_data using lm() function.The dependent variable is total_spent, and the independent variables are interestGroup and educationLevel.
model <- lm(total_spent ~ interestGroup + educationLevel, data = wage_data)
model

Call:
lm(formula = total_spent ~ interestGroup + educationLevel, data = wage_data)

Coefficients:
                      (Intercept)                     interestGroupB  
                         35276.37                           -2849.09  
                   interestGroupC                     interestGroupD  
                         -2936.16                             583.35  
                   interestGroupE                     interestGroupF  
                         -3019.36                           -5420.24  
                   interestGroupG                     interestGroupH  
                         -4753.63                            -993.26  
                   interestGroupI                     interestGroupJ  
                            58.03                           -2917.00  
educationLevelHighSchoolOrCollege            educationLevelBachelors  
                          5816.44                           28023.60  
           educationLevelGraduate  
                         42627.88  
# Uisng check_collinearity from the performance package to check regression model for multicollinearity by calculating the variance inflation factor (VIF)
check_collinearity(model)
# Check for Multicollinearity

Low Correlation

           Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
  interestGroup 1.04 [1.01, 1.22]         1.02      0.96     [0.82, 0.99]
 educationLevel 1.04 [1.01, 1.22]         1.02      0.96     [0.82, 0.99]
Show the code
check_c <- check_collinearity(model)
# Plots the chart of VIF against education level and interest group
plot(check_c)

From the results, we can determine that there is a low correlation between education level and interest group attended against wage

3.3.4 Percentage of spending within each category for each group

Show the code
# Calculates the total spending by spending category and household type, excluding the "Wage", and calculates the percentage of total spending for each group. This data frame is then save into "spending_by_kids"
spending_by_kids <- merged_data %>%
  filter(category != "Wage") %>%
  group_by(haveKids, category) %>%
  summarise(total_spending = sum(total_spent)) %>%
  mutate(percentage = total_spending/sum(total_spending)*100)

# Convert haveKids to a factor for easier plotting
spending_by_kids$haveKids <- as.factor(spending_by_kids$haveKids)

# Saving the plot to "spending_by_kids_plot"
spending_by_kids_plot <- ggplot(spending_by_kids, aes(x = haveKids, y = percentage, fill = category)) +
  geom_bar(position = "stack", stat = "identity") +
  scale_fill_viridis_d() +
  theme_minimal() +
  labs(title = "Percentage of Spending by Category and Household Type",
       x = "Household Type",
       y = "Percentage of Spending") +
  guides(fill = guide_legend(title = "Spending Category"))

# Display plot
ggplotly(spending_by_kids_plot)

We can observe that those who do not spend on education at all have no kids while household which do have kids has greater spending on shelter and food and lower on recreation.

3.3.5 Spending category by education level

Show the code
# Removes wage from merged_data and saves new data frame as spending_data
spending_data <- merged_data %>% filter(category != "Wage")

# Create the plot of spending category against education level
ggplot(spending_data, aes(x = category, y = total_spent, fill = category)) +
  geom_col(position = "dodge") +
  facet_wrap(~educationLevel, ncol = 2) + # Wrap the 4 charts of different education into 1
  scale_fill_discrete(name = "Spending Category") +
  labs(x = "Spending Category", y = "Amount")

From the chart we can observe that, with the exception of rent adjustment, spending pattern by category across different education level is generally the same. High school educated and bachelors educated has the highest positive rent adjustment which may indicate that there are more landlords from these 2 education backgrounds

3.3.6 Recreation spending vs joviality

Show the code
# Keeping only entry with recreation as category in merged_data and save the new data frame as recreation_data
recreation_data <- merged_data %>%
  filter(category == "Recreation")

# Create scatter plot of joviality vs recreation spending
ggplot(recreation_data, aes(x = total_spent, y = joviality)) +
  geom_smooth() +
  labs(x = "Recreation Spending", y = "Joviality") + geom_point()

From the above observation, it seems that increasing spending would increase joviality until a saturation is reached at around $10000. Note that the chart is plotted on a negative x-axis as spending is accounted for as negative value.

Let’s perform a statistical test to determine if there is a correlation bewteen joviality and recreation spending

Show the code
# Creates a plot of joviality against recreation spending
ggscatterstats(
  data = recreation_data,
  x = total_spent,
  y = joviality,
  marginal = FALSE,
  ) + labs(x = "Recreation", y = "Joviality")

At 5% significance level, since the p value is smaller than 0.05, we reject the null hypothesis that there is no correlation between joviality and recreation expenses

4 Conclusion

In conclusion, I believe that the authority there are a few points in which the they can look at when allocating the grant.

  • Greater funds should be channeled to support the education of residents given that it is the lowest expenditure. Currently education is only observed in residents who has kids and I assume that this expenditure is on the kids only. Perhaps the authorities can look at sponsoring adults continuous learning as well.

  • Mean joviality increases with increase increasing education and that this may be a reason to further support fundings for education purposes.

  • Joviality has an inverse relationship with wage and the manpower authority should perhaps examine the current laws and manpower landscape to determine if employees are properly treated in workplace

  • The mean age of residents having kids is 38 and the percentage of residents who have kids decreases as their education level increase. Targeted family development measures should be put in place if authorities seeks to reduce this age and encourage fertility rate among certain education group level

  • Shelter accounts for the largest expenditure in the category and residents who have kids spends on a larger percentage for shelter. Since shelter is a fundamental necessity, the authorities can perhaps consider rental subsidies and support for qualified residents.

  • Increasing recreation expenses improves joviality to a certain extent. If the city want to improve the happiness of the residents, perhaps they can consider measures such as free tours to parks,etc

To sum up, I have examined some variables and their correlation to each another. I have also provided some areas in which the authority can look at to disburse their city renewal grant and revitalize the community. However, as correlation does not equal causation, the above only serves as a reference to which the authority can look at. Further examination with more in dept data are required to understand the community and solve the problems more efficiently .